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Abstract 

Much work has been done refining and characterizing the receptive fields learned 
c/} by deep learning algorithms. A lot of this work has focused on the development of 

Gabor-like filters learned when enforcing sparsity constraints on a natural image 
dataset. Little work however has investigated how these filters might expand to the 
^ temporal domain, namely through training on natural movies. Here we investigate 

£T) exactly this problem in established temporal deep learning algorithms as well as 

a new learning paradigm suggested here, the Temporal Autoencoding Restricted 
CO Boltzmann Machine (TARBM). 

00 
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1 Introduction 

^ In the early days of Machine Learning, feature extraction was usually approached in a task-specific 

J^~j way. The complexity and high dimensionality involved in doing so in an unsupervised fashion was 

seen as a major barrier and expert features were thought to yield the best results for classification and 
representation tasks Q. Recently however, a number of advances have brought the field of unsuper- 
vised feature extraction back into the center stage of machine learning. Increases in computational 
power, allowing for algorithms trained on very large datasets, together with new techniques to train 
deep architectures have yielded insightful results in unsupervised feature learning even in uncurated 
sets of natural images Q. Examples of such algorithms are denoising Autoencoders (dAEs) and 
Restricted Boltzmann Machines (RBMs) SIHIBI. 

In unsupervised feature learning, it is the structure of the data that defines the features to be learnt 
by a given model. In Computational Neuroscience, this link between the ensemble of natural stimuli 
an organism is exposed to and the shape of the tuning functions in their sensory systems has been 
a subject of great interest GJISIQ. Specifically in the field of vision neuroscience, a number of 
principles have been proposed to explain the shape of tuning functions in primary visual cortex based 
on the properties of natural images, for example redundancy minimization [ 8 ] and predictive coding 
l9l . In recent years, it has been shown that simple unsupervised learning algorithms such as Sparse 
Coding, dAEs and RBMs can also be used to learn structure from natural stimuli, independently of 
labels and supervision, and that the types of structure learnt can be related back to cortical receptive 
fields found in the mammalian brain fTOl fTTTl . 
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While most of this research in vision has focused on finding optimal filters for representing and 
decoding sets of static natural images |T2l[T3]|, here we seek to understand how these optimal filters 
extend to the temporal domain. We build on existing work in the field to develop the Temporal 
Autoencoding Restricted Boltzmann Machine (TARBM) and show that it is able to learn high level 
structure in a natural movie dataset and account for the transformation of these features over time. 

2 Existing Models 

Restricted Boltzmann Machines (RBMs) lfT4l H31 and Autoencoders (AEs) fT6l [T71 have in recent 
years become prominent methods of unsupervised feature learning with applications in a wide vari- 
ety of machine learning fields. As both of these models are well known and discussed at length in 
many other papers, we will introduce them only briefly here. 

Both models are two-layer neural networks, all to all connected between the layers but with no 
intralayer connectivity. The models consist of a visible and a hidden layer, where the visible layer 
represents the input to the model whilst the hidden layer's job is to learn a meaningful representation 
of the data in some other dimensionality. We will represent the visible layer activation variables by 
Vi, the hidden activations by hj and the vector variables by v = {vi} and h = {hj}. 

Autoencoders are a deterministic model with two weight matrices wi and W2 representing the flow 
of data from the visible-to-hidden and hidden-to- visible layers respectively (see figure [T])). AEs are 
trained to perform optimal reconstruction of the visible layer, often by minimizing the mean-squared 
error (MSE) in a reconstruction task. This is usually evaluated as follows: Given an activation pattern 
in the visible layer v, we evaluate the activation of the hidden layer by h = sigm(v T wi + b^). 
These activations are then propagated back to the visible layer through v = si^m(hwj + b v ) 
and the weights wi and w 2 are trained to minimize the distance measure between the original 
and reconstructed visible layers. For example, using the squared euclidian distance we have a cost 
function of 

£( Wl) w 2 ,b",b\{v rf }) = £ \\v d ~ v rf || 2 , 

d 

where we have denoted the dataset by {v d } and the biases of the visible and hidden layer as b v and 
bh respectively. The weights can then be learned through stochastic gradient descent on the cost 
function. 

Restricted Boltzmann Machines on the other hand are a stochastic model that assumes symmetric 
connectivity between the visible and hidden layers (see Figure [T^) and seeks to model the structure 
of a given dataset. They are generally viewed as energy-based models, where the energy of a given 
configuration of activations {v^ and {hj} is given by 

Erbm(v, h|w, b v , b h ) = -v T wh - bjv - b^h. 

RBMs are usually trained through contrastive divergence, the central idea of which is to stabilize 
the transient induced by the presentation of data to the visible layer, therefore representing it in 
the hidden layer optimally. In practice this is achieved by learning the weights via the difference 
between the transient and the equilibrium correlations between visible and hidden layers. Sample 
correlations in the first presentation are taken as a proxy for the transient and correlations after n 
successive Gibbs samples are taken as a proxy for the equilibrium correlation. The weight update is 
then defined as 

Awij oc (vihj) Q - (vihj) n . 

A number of auxiliary strategies have been used to improve the training process of RBMs such as 
mini-batch training, free energy minimization, Parzen windows, early stopping and sparsity con- 
straints. In addition, RBMs can be stacked to form what is called a Deep Belief Network (DBN) 
|[T5l where each additional RBM models the output of the previous one to form a more abstract/high 
level representation. 

To date, a number of RBM based models have been proposed to capture the sequential structure 
in time series data. Two of these models, the Temporal Restricted Boltzmann Machine and the 
Conditional Restricted Boltzmann machine, are introduced below. 
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2.1 Temporal Restricted Boltzmann Machine (TRBM) 



The Temporal Restricted Boltzmann Machine [18 ] is a temporal extension of the standard RBM 
whereby feed forward connections are included from previous time steps between hidden layers, 
from visible-to-hidden layers and from visible-to- visible layers. Learning is conducted in the same 
manner as a normal RBM using contrastive divergence and it has been shown that such a model can 
be used to learn non-linear system evolutions such as the dynamics of a ball bouncing in a box fT8ll . 
A more restricted version of this model, discussed in |fT9l can be seen in Figure [2]} and only contains 
temporal connections between the hidden layers. 

If we denote by h = {ho, hi, ... , Iim} the hidden layers and by v = {vo, Vi, . . . , vm} the visible 
layers, the energy of the model is given by 

M M 

£(h,v|W) = Y f E RBM (h i y\w,b) - ^(h°) T w,h*, (1) 

i=0 i=l 

where the weights are as given in Figure [2b. We denoted W = {w, wi, . . . , wm}, where w are 
the static weights and wi to wm are the aelayed weights. These models have been shown to be 
amenable to stacking in deep architectures in the same manner as RBMs and AEs. 

2.2 Conditional Restricted Boltzmann Machine (CRBM) 

The Conditional Restricted Boltzmann Machine described in [ 20 ] contains no temporal connections 
from the hidden layer but includes connections from the visible layer at previous time steps to the 
current hidden and visible layers. The model architecture can be seen in Figure [2^. Again, learning 
with this architecture requires only a small change to the energy function of the RBM and can be 
achieved through contrastive divergence. The CRBM is likely the most successfull of the temporal 
RBM models to date and has been shown to both model and generate data from complex dynamical 
systems such as human motion capture data and video textures EH . 

3 Temporal Autoencoding Restricted Boltzmann Machines (TARBM) 

Here we present a new model, the TARBM, an extension of the Temporal RBM (with only hidden- 
to-hidden temporal connections) where a denoising Autoencoder approach is used to pretrain the 
temporal weights. We show that this approach provides a marked advantage over contrastive di- 
vergence training alone and that our model is able to outperform both the TRBM and CRBM on a 
classical temporal sequence task while yielding a deeper insight into the temporal representation of 
natural image sequence data. 

3.1 The Model 

Much of the motivation for this work is to gain insight into the typical evolution of learned hidden 
layer features present in natural movie stimuli. With the CRBM this is not possible as it is unable to 
explicitly model the evolution of hidden features without resorting to a deep network architecture. 
We address this by using a layerwise approach, much in the same vein as that used when stacking 
RBMs to form a Deep Belief Network fl5l . but through time. We stack a given number of RBMs 
side by side in time and train the temporal connections between the hidden layers (see Figure [2]}) 
to minimize the reconstruction error, in a process similar to Autoencoder training [16]. A simple 
autoregressive model is used to account for the dynamics of the hidden layer allowing us to train a 
dynamic prior over the temporal evolution of the stimulus. 

3.2 Training Algorithm 

We model our network as an energy-based function with interactions between the hidden layers at 
different time lags. The energy of the model is given by Equation [T] as in the case of the TRBM 
and is essentially an M-th order autoregressive RBM model and can be trained through standard 
contrastive divergence. The individual RBM visible-to-hidden weights w are initialized through 
contrastive divergence with a sparsity constraint on static samples of the dataset. After that, to ensure 
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(a) (b) 



Figure 1: Restricted Boltzmann Machine (a) and Autoencoder (b) architectures 



that the weights representing the hidden-to-hidden connections (w^) encode the dynamic structure 
of the ensemble, we initialize them by pre-training in the fashion of a denoising Autoencoder. We do 
this by treating the visible layer activation at time t — d, where d is the temporal delay, as a corrupted 
version of the true visible activation at time t. With this view, the model should learn to reconstruct 
the visible layer at time t by transforming the corrupted input at t — d through the model as in the 
case of a denoising Autoencoder. The pretraining is described in Algorithm [T] 



Algorithm 1 Pre-Training Temporal weights through Autoencoding 

for each sequence of images I(t — d), . . . , /(£), we take v° = 7(t), . . . , v d = I(t — d) and do 
for d = 1 to M do 
for i = 1 to d do 

= sigm^w + b^) 
end for 

h° = sigm(b h + w J hJ ) ^° = h ° wT + b - 

Error(v ,v°) = |v -v°| 2 
Aw fi = ^<9Error/<9w d 
end for 
end for 



One can regard the weights w as a representation of the static patterns contained in the data and the 
Wd as representing the transformation undergone by these patterns over time in the data sequences. 
This allows us to separate the representation of form and motion in the case of natural image se- 
quences, a desirable property that is frequently studied in natural movies (see (22)). 

4 Experiments 

We first assess the TARBM's ability to learn multi-dimensional temporal sequences by applying 
it to the 49 dimensional motion capture data described in [20] and comparing the performance to a 
TRBM^and Graham Taylor's example CRBM implementation^] All three models are implemented 
using Theano l23lL have a temp oral dependancy of 6 frames and were trained using minibatches of 
100 samples for 500 epochfpj The training time for the models was approximately equal. Training 
was performed on the first 2000 samples of the dataset after which the models were presented with 
1000 snippets of the data not included in training set and required to generate the next frame in the 
sequence. The results of a single trial prediction for 4 dimensions of the dataset can be seen in Figure 

this section we refer to the reduced TRBM model referenced in 1 19] with only hidden to hidden temporal 
connections 

2 CRBM implementation available at https://gist.github.com/2505670 

3 For the TRBM, training epochs were broken up into 100 static pretraining and 400 epochs for all the 
temporal weights together 

4 For the TARBM, training epochs were broken up into 100 static pretraining, 50 Autoencoding epochs per 
delay and 100 epochs for all the temporal weights together 
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Figure 2: (a) Conditional Restricted Boltzmann Machine Architecture (b) Architecture used by the 
Temporal RBM and the Temporal Autoencoding RBM 



CRBM TRBM TARBM 




20 40 60 80 100 120 140 20 40 60 80 100 120 140 20 40 60 80 100 120 140 
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Figure 3: CRBM, TRBM and TARBM used to fill in data points from motion capture data (20]|. 4 
dimensions of the motion data are shown along with the their model reconstructions from a single 
trial. 



[3] and the mean squared error of the model predictions over 100 repetitions of the task can be seen in 
Table [T] The TARBM by far outperforms the TRBM model in this task and is also somewhat better 
than the CRBM ^ The gain in performance from the TRBM to TARBM model, which are both 
structuraly identical, would suggest that our approach of Autoencoding the temporal dependancies 
gives the model a more meaningful temporal representation than is achievable through contrastive 
divergence alone. 



5 No attempt was made to tune the CRBM beyond the code provided, as such it is possible that better 
performance could be achieved. 
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Table 1 : Prediction results on the motion capture dataset 
Model Architecture and Training Mean Squared Error 



TRBM 100 hidden units, 6 frame delay 1 .82 
CRBM 100 hidden units, 6 frame delay 0.64 
TARBM 100 hidden units, 6 frame delay 0.37 



The second experiment was to model a natural movie dataset and investigate the types of filters 
learned. Here we take the Holly wood2 dataset introduced in l24lL consisting of a number of snippets 
from various Hollywood films and compare the CRBM implementation referenced with our TARBM 
model. From the dataset, 8x8 pixel patches are extracted in sequences 30 frames long. They are then 
contrast normalised and whitened to provide a training set of approximately 250,000 samples. The 
models, each with 400 hidden units and a temporal dependancy of 3 frames, are trained initially for 
100 epochs on static frames of the data to initialise the w weights and then until convergence on the 
full temporal sequences. 

Visualisation of the temporal receptive fields learnt by the CRBM involves displaying the weight 
matrix w and the temporal weights wi to for each hidden unit as a projection into the visible 
layer (an 8x8 patch). This shows the temporal dependance of each hidden unit on the past visible 
layer activations and is plotted with time running from left to right. The visualisation process for the 
TARBM is somewhat more complicated as each hidden unit is also dependant on a number of hidden 
units from each delay time in the model and as such cannot be visualised as a direct projection of 
the weights into visible layer. To understand how these units depend on the past we use a forward 
projection method through the temporal delays whereby a hidden unit h at delay time t — d is chosen 
as the starting point. We then use the relative weights for unit h in w\ to find the n most likely units 
to be active at time t — (d — 1) given that unit h was active at t — d. For each of the n active units 
at t — (d — 1), we choose n active units at time t — (d — 2) given the activations of unit h at t — d 
propogated through W2 and one of the n units at t — (d — 1) propogated through wi. This process 
is repeated until the full delay of the network is mapped out. For each of the active hidden units, the 
projection onto an 8x8 patch of the hidden layer is defined in the weight matrix w. When plotted 
for n = 1, this trace displays the most likely evolution of the hidden layer over the delay period of 
the model for each hidden unit. 

A subset of the temporal filters learned by each of the models can be seen in Figure [4] with the 
TARBM on the left and the CRBM on the right. While both the TARBM and the CRBM learn gabor 
like filters at time t, their dependance on the past is markedly different. Most hidden units in the 
CRBM fail to capture any structured dependance on delay times greater than d = 1. This makes the 
CRBMs temporal filters difficult to interperet with respect to structure in the image. The layerwise 
training of the temporal weights in the TARBM along with the forced reliance on filters learned in 
w for its delay input give the TARBM not only a longer temporal dependance, but also allow the 
weights learned to be easily interpereted as a transformation of the learned filters. 

Figure [5] shows the forward projection method of visualising the TARBM for n = 3 from selected 
hidden units. This means that for each delay step, three most likely filters to be active at the next 
point in time are shown. The model is able to learn multiple trainformations over time for each of the 
hidden unit receptive fields. The transformations often represent simple operations such as rotation 
and translation of the static features, seperating the modeling of form and motion. 



5 Discussion and Future Work 

We have shown that by using an Autoencoder to initialise the temporal weights of a TRBM, form- 
ing what we call a TARBM, a significant performance increase can be achieved in modelling and 
generating from a sequential motion capture dataset. We also show that the TARBM is able to learn 
high level structure from natural movies and account for the transformation of these features over 
time. Additionally, the evolution of the learned temporal filters are easily interpretable and help to 
better understand how the model represents the trained data. 
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Filter Dynamics for the TARBM 
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Filter Dynamics for the CRBM 



Figure 4: The Temporal features of a subset of hidden units from a TARBM (left) and a CRBM 
(right). For the TARBM, we plot the most active units as described in the text (n = 1). Each group 
of 4 images represents the temporal filter of one hidden unit with the lowest patch representing time 
t and the 3 patches above representing each of the delay steps in the model. Temporal filters for 
the 80 units (out of 400) with highest temporal variation of the receptive fields for both models are 
shown. The units are displayed in two rows of 40 columns with 4 filters, with the temporal axis 
going from top to bottom. 
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Figure 5: Temporal Filters of 3 hidden units in the TARBM after training on the Holly wood2 dataset 
(n = 3). The top image shows the schematic of the three images below it. Each patch in the top 
row of an image represents the activation of a single hidden unit at time t — 3 where d = 3 is the 
delay of the TARBM. The second row down shows the 3 most likely units to be activated at t — 2 
given the activation of the unit at t — 3 and so on for the 3rd and 4th rows, forming a tree structure 
of dependancy. For ease of interpritation, units with multiple descendants are repeated so that each 
column can easily be read top to bottom. 
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The presented model could with minimal effort be adapted into a deep architecture, allowing us to 
represent higher order features in the same temporal manner. We propose that learning higher order 
temporal features might prove to be useful for control tasks such as image stabilization and object 
tracking. In addition, we hope to study the relation of the presented encoding strategy with strategies 
employed by the mammalian visual cortex l25l . Another interesting avenue of research will be to 
apply the current model to classification and generative tasks. 
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